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Abstract 

(N ■ 

We introduce a family of numerical algorithms for the solution of linear system in 
^ | higher dimensions with the matrix and right hand side given and the solution sought 

^ ■ in the tensor train format. The proposed methods are rank-adaptive and follow the 

£Nj . alternating directions framework, but in contrast to ALS methods, in each iteration a 

tensor subspace is enlarged by a set of vectors chosen similarly to the steepest descent 
^ ■ algorithm. The convergence is analysed in the presence of approximation errors and 

the geometrical convergence rate is estimated and related to the one of the steepest 
descent. The complexity of the presented algorithms is linear in the mode size and 
dimension and the convergence demonstrated in the numerical experiments is com- 
parable to the one of the DMRG-type algorithm. 

Keywords: high-dimensional problems, tensor train format, ALS, DMRG, steepest 
descent, convergence rate, superfast algorithms. 

> . 

1 Introduction 

o : 

Linear systems arising from high-dimensional problems usually can not be solved by stan- 
dard numerical algorithms. If the equation is considered in d dimensions onani x ri2 X 
. . . x grid, the number of unknowns ni ... rid scales exponentially with d, and even for 
moderate dimension d and mode sizes the numerical complexity lays far beyond the 
technical possibilities of modern workstations and parallel systems. To make the problem 
tractable, different approximations are proposed, including sparse grids p8l [3j and ten- 
sor product methods ||MlE2lE^l[Tl|| . In this paper we consider the linear system Ax = y, 
where the matrix A and right-hand-side u are given and approximate solution x is sought 
in the tensor train (TT) format. Methods based on the TT format, also known as a linear 
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tensor network, are novel and particularly interesting among all tensor product methods 
due to their robustness and simplicity. 

The numerical optimization on tensor networks was first considered in quantum physics 
community by S. White [42|, who introduces the matrix product states (MPS) formalism to 
represent the ground state of a spin system together with the density matrix renormaliza- 
tion group (DMRG) optimization scheme. The tensor train format and some computational 
methods were independently re-discovered in the papers of Oseledets and Tyrtyshnikov 
(see [30J and references therein) until the results of White et. al. were popularized in the 
numerical mathematics community by R. Schneider ||18||. The questions concerning the 
convergence properties of alternating schemes for different tensor product formats were 
immediately raised and studied. The experimental results from quantum physics show 
the notably fast convergence of DMRG for the ground state problem, i.e., finding the min- 
imal eigenstate of a system, but give no theoretical justification for this observation. The 
alternating least squares (ALS) algorithm was used in multilinear analysis for the computa- 
tion of canonical tensor decomposition since early results of Hitchcock [T7| and was known 
for its monotone but very slow convergence. For ALS there is also a lack of convergence 
estimates both in the classical papers |T6l HI, and in the recent ones, where ALS was ap- 
plied to the Tucker model |33j, tensor trains |29], hierarchical Tucker format [25] and 
high-dimensional interpolation ||34||. 

In recent papers by Uschmajew |[4Tl l35]| the local convergence of ALS is proven for 
the canonical and tensor train decompositions. This is a major theoretical breakthrough, 
which unfortunately does not immediately lead to practical algorithms due to the local 
character of convergence studied, unjustified assumptions on the structure of the Hessian, 
and very strong requirements on the accuracy of the initial guess. The convergence rate of 
ALS is difficult to estimate partly due to the complex geometrical structure of manifolds 
defined by tensor networks. This problem is now approached from several directions, and 
we might expect new results soon | |12U11 ||. 

In contrast to ALS schemes which operate on manifolds of fixed dimension, the DMRG 
algorithm changes the ranks of a tensor format. This allows to choose the ranks adap- 
tively to the desired error threshold or the accuracy of the result and develop more prac- 
tical algorithms which do not rely on a priori choice of ranks. The DMRG was adopted 
for novel tensor formats (see references above) and new problems, including adaptive 
high-dimensional interpolation | |37 || and solution of linear systems ©[TBI. The geometri- 
cal analysis, eg the convergence of the nonlinear Gauss-Seidel method, is however even 
more difficult when the dimensions of underlying manifolds are not fixed. 

Apart of working with the tensor format structure directly, like ALS and DMRG do, 
standard algorithms from numerical linear algebra can be applied with tensor approx- 
imations and other tensor arithmetics. Following this paradigm, the solution of linear 
problems in tensor product formats was addressed in p6l[Tll6|. The usual considerations 
of linear algebra can be used in this case to analyze the convergence. A first notable ex- 
ample is the method of conjugate-gradient type for the Rayleigh quotient minimization 
in higher dimensions, for which the global convergence was proven by O. Lebedeva [27J. 

We develop a framework which combines the ALS optimization steps (ranks are fixed, 
convergence estimates not yet possible) with the steps when the tensor subspaces are in- 
creased and the ranks of a tensor format grow. Choosing the new vectors in accordance 
with standard linear algebra algorithms, we recast the classical convergence estimates for 
the proposed algorithm in higher dimensions. In this paper we consider the case of sym- 
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metrical positive definite (SPD) matrices and analyze the convergence in the A-norm, i.e. 
minimize the energy function. The basis enrichment choice follows the steepest descent (SD) 
algorithm and the convergence of the resulted method is analyzed with respect to the one 
of steepest descent. We show that the basis enrichment step combined with the ALS step 
can be seen as a certain computationally cheap approximation of the DMRG step. The 
complexity of the resulted method is equal to the one of ALS and is linear in the mode 
size and dimension. Our choice of the basis enrichment appears to be very good for prac- 
tical computations, and for the considered numerical examples the proposed methods 
converge almost as fast as the DMRG algorithm. 

Summarizing the above, the proposed algorithms have (1) proven geometrical conver- 
gence with the estimated rate, (2) practical convergence compared to the one of DMRG, 
(3) numerical complexity compared to the one of ALS. 

The paper is organized as follows. 

In Section |2] we introduce the tensor train notation and necessary definitions. 

In Section |3] we introduce the basic notation for ALS and DMRG schemes. We also 
study how the modification of one TT-block affects the ALS problem for its neighbor and 
describe this in terms of the Galerkin correction method. 

In Section |4] we develop the family of steepest descent methods for the problems in 
one, two and many dimensions. The proposed methods have an inner-outer structure, 
i.e., a steepest descent step in d dimensions is followed by a steepest descent step in d — 1 
dimension, etc, cf. the interpolation algorithms [32lll3 ||. The convergence of the recursive 
algorithms in higher dimensions is analyzed using the Galerkin correction framework. 
The effect of roundoff/ approximation errors is also studied. 

Since we make no assumptions on the TT-ranks of the solution, the ranks of the vectors 
in the proposed algorithms can grow at each iteration and make the algorithm inefficient. 
In Section|5]we discuss the implementation details, in particular the steps when the tensor 
approximation is required to reduce the ranks. 

In Section[6]the model numerical experiments demonstrate the efficiency of the method 
proposed and compare it with other algorithms mentioned in the paper. 



2 Tensor train notation and definitions 

The tensor train (TT) representation of a d-dimensional tensor x = [x(ti , . . . , id)] is written 
as the following multilinear map (cf. [35]) 

X = T(X)=T(X( 1 \...,X( d '), 

x(i b ...,i d ) = xa >wi (ii)x£U(i 2 ) . . .xa^J^ (id^OX^^Cia), 

where = 1 , . . . , n k are the mode (physical) indices, oc^ = 1 , . . . , r k are the rank indices, X' k ' 
are the tensor train cores (TT-cores) and X = (X' 1 ', . . . , X' d ') denote the whole tensor train. 
Here and later we use the Einstein summation convention [10], which assumes a summa- 
tion over every pair of repeated indices. Therefore, in Eq. ([T]) we assume the summation 
over all rank indices £X k , k = 1 , . . . , d — 1 . We also imply the closed boundary conditions 
r o = I'd = 1 to make the right-hand side a scalar for each . . . ,i d ). Eq. ([T]) is written in 
the elementwise form, i.e., the equation is assumed over all free (unpaired) indices. It is 
often convenient in higher dimensions and will be used throughout the paper. 
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The indices can be written either in the subscript Xj or in brackets x(j). For the sum- 
mation, there is no difference. The subscripted indices are usually considered as row and 
column indices of a matrix, while the indices in brackets are seen as parameters. For exam- 
ple, each TT-core X (k) is considered as a parameter-dependent on i k matrix with the row 
index a k _i and the column index <x k as follows 

x (k) = [x«_ i)tXk (i k )] e c Tk - lXnkXr % x (k) (i k ) e C Tk - ,XTk . 

In our notation X^ (i k ) is a matrix, for which standard algorithms like orthogonalization 
(QR) and singular value decomposition (SVD) can be applied. We will freely transfer in- 
dices from subscripts to brackets in order to make the equations easier to read or to empha- 
size a certain transposition of elements in tensors. It brings the notations in consistence 
with previous papers on the numerical tensor methods, e.g. |[T8ll9ll8ll35| and others. 

We will reshape arrays into matrices and vectors by using the index grouping, i.e., com- 
bining two or more indices a, . . . , C in a single multi-index a ... C- Following | |35 || we define 
interface matrices X a e C ni - nkXTk and X >k e C TkXnk +'- nd as follows 

X< k (ijW^ a k ) = X£J(ii)Xg^(i 2 ) . . .Xg_ 1)a Ji k ), 

x> k (a k , i k+1 . . . wQ = x££ +1 (i k+1 ) . . . xS^;^ (i^ )x£>_, (id), 

and similarly for symbols X <k and X^ k . Using the T notation defined in <Q) we can write 
x = x(X^ k , X >k ). For a tensor x = [x(ii , . . . , x d )] we also define the unfolding matrix, which 
consists of the entries of the original tensor as follows 

X W ( W^, i k+1 . . . i d ) = X(i! , . . . , id), X W G C n — ^n fc , 1 ...n d< 

For x in the TT-format © it holds X {k} = X^ k X >k and therefore rankX {k} = r k . In EDI 
the reverse is proven: for any tensor x there exists the representation ([T]) with TT-ranks 
r k = rankX^. This gives the term TT-rank the definite algebraic meaning. As a result, 
the tensor train representation of fixed TT-ranks yields a closed manifold, and the rank- 
(ti , r d _i ) approximation problem is well-posed. We can also approximate a given ten- 
sor by a tensor train with quasi-optimal ranks using a simple and robust approximation 
(rank truncation, or tensor rounding) algorithm ||30||. This is the case for all tensor networks 
without cycles, eg. Tucker [40J, HT ||15||, QTT-Tucker [8], etc. In contrast, the MPS formal- 
ism originally assumes the periodic boundary conditions ceo = a,d and sum over these indices, 
which leads to Tr(X' ] ' . . . X' d '), where all matrices can be shifted in cycle under the trace. 
The optimization in such type of tensor networks is difficult, because they form unclosed 
manifolds and the best approximation does not always exist. 

The tensor train representation of the matrix is made similarly with the TT-cores de- 
pending on two parameters i k , j k . Hence, x = t(X) is sought in the form ((T) and A and u 
given in the TT-format as follows 

A(i 1 ,...,i d ; ji,...,j d ) = A (1) (ii,ji)...A (d) (i d ,jd), 
u(i 1 ,...,i d )=Y( 1 Hii)...Y( d Hid). 

For A and x given in the TT-format, the matrix-vector product c = Ax is also a TT- 
format computed as follows 

c(i b ...,i d ) = (A (1) (ibji)®X (1) (ji))...(A (d) (id,jd)®X (d) (j d )), 
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where <g> denotes the tensor (Kronecker) product of two matrices defined as follows 

A= [A(i,j)], B= [B(p,q)], C = A®B= [C(ip~,Jq~)] = [A(i, j)B(p, q)] . 

We refer to [30J for more details on basic tensor operations in the TT-format. 

In this paper we will use standard I2 scalar product ( • , • ) and the A-scalar product ( • , • ) a 
defined by a symmetrical positive definite (SPD) matrix A as follows 

(U, V) A = U* AV, ||u|| \ = (u, U) a- 

For a given nonsingular matrix U we define the A-orthogonal projector Ru as follows: for 
all v and all w G span U it holds 

RuvGspanU, (w, R u v) A = (w, v) A , R u = U(U*AU) _1 U*A. 

We will use vector notations for mode indices i = . . . ,id) and rank indices r = 
(to, . . . , r d ). We also denote the subspace of tensor trains X = (X (1) , . . . , X td ') with tensor 
ranks r as 

T r = x c n - lXniXTi . 

i=1 



3 Alternating minimization methods 

3.1 ALS-like minimization with fixed TT-ranks 

The MPS formalism was proposed in Quantum Physics, where the representation ((T]) was 
used for the minimization of the Rayleigh quotient (x, Ax)/(x, x). Similarly, the solution of 
a linear system Ax = y with A = A* can be sought through the minimization of an energy 
function 

J(x) = ||x* — x|| A = (x, Ax) — 29 £ t(x,t|) + const, (4) 

where x* denotes the exact solution. We consider the Hermitian matrix A = A* and the 
right-hand side y given in the TT-format (©, and solve the minimization problem with x 
sought in the TT-format ((D with fixed TT-ranks r , i.e., X* = arg min XeTr J (f (X) ) .This heavy 
nonlinear minimization problem can hardly be solved unless a (very) accurate initial guess 
is available (see, eg. ||35||). To make it tractable, we can use the alternating linear optimiza- 
tion framework and substitute the global minimization over the tensor train X G T r by the 
linear minimization over all cores X' 1 ', . . . , X' d ' subsequently in a cycle. Solving the local 
problem we assume that all cores but k-th of the current tensor train X = (X' 1 ', . . . , X' d ') 
are 'frozen', and the minimization is done over X^ as follows 

X new = (X' 1 V . . , X^ 1 5 , XW, X^ 1 ),..., X< d > ) , where 

=argminJ(T(X)), s.t. X w eC Tk -' xntXri . (5) 

x< k > 

Clearly, the energy function does not grow during the sequence of ALS updates and the 
solution will converge to a local minimum. 

To write each ALS step as a linear problem, let us stretch all entries of the TT-core m 

the vector x k ( cx^-i ikCtk) = xL k k '_, >aic (ik) ■ From ((T]) we see that x = X^ k Xk, where X^ k = CP^k(X) 
is the ni . . .rid x r^n^r^ matrix defined as follows 

wu, = ft ) . . . x^U , (W-i )6(ifc, iOXg*^ , (Wi ) • • • xL d J , ^ ^ 

T (6) 

X^ k = 7^{X) = X <k ® I nk ® (X >k ) , 
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Figure 1: Tensor network corresponding to the quadratic form (Ax, x) with matrix A and 
vector x given in the tensor train format. The boxes are tensors with lines (legs) denoting 
indices. Each bond between two tensors assumes a summation over the join index. 

where 5(1, j ) is the Kronecker symbol, i.e., 6(i, j) = 1 if i = j and 6(i, j) = elsewhere. If 
J(t(X)) is considered as a function of x k , it is also the second-order energy function 

J(t(x)) = (AX^x k ,l /k x k ) -2(u,X^ k x k ) = (l^AX^x b x k ) - 2(X^ k y, x k ), 

where the gradient w.r.t. x k is zero wherfj] 

(X^ k AX^ k )x k = X^u. (7) 

The solution of the local minimization problem (0) is therefore equivalent to the solution 
of the original system Ax = y in the reduced basis X^ k = CP^ k (X), defined by ©. 

The tensor train representation (d) is non-unique. Indeed, two representations X and 
Y map to one tensor t(X) = t(Y) as soon as 

Y (k) (i k )=H k ! 1 X (k) (i k )H k , k=l,...,d, 

where H = H d = 1 and H k G C TkXTk , k = l,...,d— 1, are arbitrary nonsingular matrices. 
Given a vector in the TT-format x = t(X), any transformation H = (H , . . . , H d ) does not 
change the energy level since J(t(X)) = J(t(Y)) but gives us some flexibility for the choice 
of the reduced basis since lP^ k (X) ^ TVk(Y) . The proper choice of the representation X essen- 
tially defines the reduced basis and affects the properties of the local problem 0. A promi- 
nent transformation H is the TT-orthogonalization algorithm proposed in [30 J. It chooses 
matrices H k applying the QR factorization to the reshaped TT-cores, i.e., matrices of size 
r k _i x n k r k and/ or r k _i n k x r k . The transformation H given by the TT-orthogonalization im- 
plies the left-orthogonality constrains on TT-cores Y' 1 ', . . . , Y' k ^ and right-orthogonality 
on Y' k+1 ', . . . , Y' d \ which results in the orthogonality of the interfaces Y <k and Y >k and 
hence the reduced basis V^ k = T^ k (Y). Such a normalization step will be assumed in many 
algorithms throughout the paper; in most cases we will do this without introduction of a 
new representation Y just by 'claiming' the necessary orthogonalization pattern of the TT 
representation we use. If the reduced basis method is applied and such a representation 

1 For illustration see Fig. [T]. for the detailed derivation see I18ll9l. 
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X is chosen so that X^ k = CP^ k (X) = 7 is orthogonal, the spectrum of the reduced ma- 
trix CP* ACP lies between the minimum and maximum eigenvalues of the matrix A. Indeed, 
using the Rayleigh quotient ||19||, we write 

A min (CP*ACP) = min(CPv, ACPv) = min (u, Au) ^ min(u, Au) = A min (A), 

||v||=1 uespanCP,||u||=l ll u ll=l 

and similarly for the maximum values. It follows that the reduced matrix is conditioned 
not worse than the original, cond(X*^ k AX^ k ) ^ cond(A). Therefore, the orthogonality of 
TT-cores ensures the stability of local problems and we will silently assume this for all 
reduced problems in this paper. 

To conclude this part, let us calculate the complexity of the local problem 0. As was 
pointed out in [9], either a direct elimination, or an iterative linear solver with fast matrix- 
by-vector products (matvecs) may be applied. If the direct solution method is used, the 
costs which are required to form the r k _ 1 n k r k x r k _!n k r k matrix of the local problem are 
smaller than the complexity of the Gaussian elimination, i.e., the overall cost is 0(n 3 r 6 )J^l 
If an iterative method is used to solve the local problem, one multiplication X*^ k AX^ k re- 
quires 0(nrAT 3 + u 2 r^r 2 ) operations, where r and Ta denote the TT-rank of the current 
solution x and the matrix A, respectively. Careful implementation of the matvec is essen- 
tial to reach this complexity, see H for details. The complexity of the normalization step 
is only 0(dnr 3 ) operations and can be neglected. 



3.2 DMRG-like minimization and adaptivity of TT-ranks 

In practical numerical work the TT-ranks of the solution are usually not known in ad- 
vance, which puts a restriction on the use of the methods with fixed TT-ranks. The under- 
estimation of TT-ranks leads to a low accuracy of the solution, while the overestimation 
results in a large computational overhead. This motivates the development of methods 
which can choose and modify the TT-ranks on-the-fly adaptively to the desired accuracy 
level. A prominent example of such method is the Density Matrix Renormalization Group 
(DMRG) algorithm [42], developed in the quantum physics community for the solution of 
a ground state problem. DMRG performs similarly to the ALS but at each step combines 
two succeeding blocks X (k) and X (k+1) into one superblock 

w k (a k _,i k i k+1 a, +1 ) = W( k J_ i)Kk+i (i^), W (k) (W7) = X (k) (i k )X (k+1) (i k+1 ), (8) 

and make the minimization over w k . Classical DMRG minimizes the Rayleigh quotient, 
our version minimizes the energy function T(x), see ||18tl9f. Similarly to ©,((Z|) we write 
the local DMRG problem Bw k = g k as follows 

y = 3Wn}(X) = X <k ® I nk ® I nic+1 ® (X> k+1 ) T e c^- n ^-^^\ 
B = CP* ACP, g k = Ty e C Tk - inknk+,Tk+1 . 

When the w k is computed, new TT-blocks are obtained by the low-rank decomposition, 
i.e. the right-hand side of © is computed and the k-th rank is updated adaptively to the 
chosen accuracy. The minimization over 0(n 2 r 2 ) components of w k leads to complexity 
0(n 3 ), and seriously increases the computational time for systems with large mode sizes. 

2 We will always assume that ni = . . . = rid = n. and n = . . . = ra-i = t in the complexity estimates. 
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3.3 One-block enrichment as a Galerkin reduction of the two-dimensional 
system 

Suppose that we have just solved and updated the TT-block X (k) . Before we move to the 
next step, we would like to improve the reduced basis T^k+i (X) by adding a few vectors 
to it. Denote the current solution vector by t = t(T) and suppose we add a step s = t(S). 
Then the updated solution x = t+ s has the TT-representation x = t(X) defined as follows 



X™(ii) 
XW(ip) 



T^(ii: 
TW(ip 








X Cd), 



id J 



S (d) (id) 



(10) 



where p = 2, . . . , d — 1 . We will denote this tensor train as X = T + s|l The considered 
update affects the solution process in two ways: first, naturally, adds a certain correction 
to the solution, and second, enlarge the reduced basis that we will use at the next step of 
the ALS minimization. Indeed, it can easily be seen from definition ((2]) that 



X 



<k 



[T< k S <k ], (X >k ) T = [(T> k ) T (S >k ) 



From © we conclude that + S) = [T <k S <k ] ® I ® [(T >k ) T (S >k ) 



and hence 



^k(T + S) 



^k(T) S <k ®I®(T >k ) T T <k ®I®(S >k ) T 9^[S) 



(11) 



The clever choice of s = t(S) allows to add the essential vectors to spanCP^ k (T + S) and 
therefore improve the convergence of ALS. 

A random choice of s G T, with some small TT-ranks r (cf. random kick proposed 
in ||37ll29| ) may lead to a slow convergence. It also introduces an unwanted perturbation 
of the solution. A more robust idea is to choose s in accordance to some one-step iterative 
method, for instance, take s ~ z = y — At and construct a steepest descent or minimal 
residual method with approximations. This choice allows to derive the convergence esti- 
mate similarly to the classical one and will be discussed in Sec. HI 

To stay within methods of linear complexity, we restrict ourselves to zero shifts s = 
with a simple TT-structure S = (0, . . . , 0, S (k ', 0, . . . , 0) . The tensor train X = T + S has the 
following structure^ 



XM(i k )= [T< k '(i k ) S< k )(i k )], X( k+1 »(W 



T< k+1 '(W 




(12) 



and X' p ' = T' p) for other p. Note that since s = 0, the enrichment step does not affect 
the energy J(t(X)) = J(t(T)). Therefore, we can choose S [k) freely and develop (proba- 
bly, heuristic) approaches to improve the convergence of our scheme. The reduced basis 



3 Due to the non-uniqueness of the TT-format other representations (probably with smaller TT-ranks) 
can exist for x — t + s. 

4 We give a description for the forward sweep, i.e. the one with increasing k = 1 , . . . , d. For the backward 



sweep the construction is done analogously. 
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Figure 2: Linear system Ax = y in the reduced basis CP^ k+1 (X) shown by tensor networks. 
The reduced system has r k n k+ ir k+ i unknowns, shown by the dark box. Gray boxes show 
the X^ k ' which is updated by to improve the convergence. White boxes contribute to 
the local matrix B and right-hand side g of the 2D system d9). 

X^k+i = y^k+i (T + S) depends on the choice of S (k) as follows (cf. ©) 

X^ic+i(ii . . .id? a.kjk+1 (X-k+i ) = X <k (ii . . . , a^-i jX^ ^^fikjSfik+i Jk+i )X >k+1 (ai<+i, ik+2 • • • i-d) 

X/w = KL ® si i nk+1 ® (x >k+1 ) T , 
x< k) , = r 



<*k-1 



T (k) n(k) 
1 «ic— 1 J «k-l 



(13) 

where X^ k is a column of X <k and X„ k ' is the u k x r k matrix which is the slice of 3-tensor 
X [k) = [X (k) (<Xk-i,ik, 0Ck)] corresponding to the fixed <x k -i, and similarly for S k (<x k _i) and 
T' k ^(ak-i). Below we will write the local system (J7J) at the step k + 1 and see how it is 
affected by the choice of S tk) . 

The two-dimensional system defined by (0) is shown by gray boxes in the Fig. |2l It 
appears here as the local problem in the DMRG method, but in the same framework we 
may consider the whole initial system with d = 2, and k = 1 , depending on what type of 
analysis we would like to perform. 

Now the reduced system for the elements of Xk+i ( |3 k j k+ i Pk+i ) = X tk+1 ' ( (3k, jk+i > Pk+i ) 
writes 

A ak;Q B Qb)Q / b /A a , >(3k Ap k)b/ -A akQ b Q ,b, 
[X^ ® I] * B [X™ ® I] x k+1 = [X( k '®l]*g, 
where the following multi-indices are introduced for brevity of notation 



ftk-iT-k = a, flk-ijk = q / > q, q' = l,...,r k _ 1 n k , 
i-k+i c^k+1 = b, jk+iSk+i=b', b, b = 1, . . . ,r k +|U k+ i, 
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and X (k) G C Tk - inkXrk , I = I ric+ ,n k+ , • The system ((14)) has r k Ti k+ ir k+ i unknowns. At the same 
time it is the reduction of a 2D system Bw = g which has r k i n k n k+ i r k+1 unknowns. There- 
fore, the choice of the enrichment S' k ' (as a part of X' k ') can be considered as a cheaper ap- 
proximation of the 2D system solution. Taking into account the structure of X' k ' from ((131) 
we rewrite ((141) as follows 



[T S}* B [T S] x k+1 = [T S]* g, T = T (k) ® I, S = S (k) ® I. (15) 

The system ((15)) is difficult to analyze. However, we may propose a certain approxi- 
mation to its solution, and estimate the quality of the solution to the whole system ((TB]) 
via the properties of the approximation. Namely, let us consider the zero-padded TT-core 
in (|12[) as the initial guess, i.e., some information about the solution x k+ i that we want 
to use. For instance, we can apply the block Gauss-Seidel step, restricting the unknown 
block to the form 



XOc+1), 



V(i k+1 ) 



t(cx k i k+ ia k+ i 
v(cx{.'i k+1 <x k+ i; 



T (k+1) ( . 



= V, 



(16) 



Then ((15)) writes as the following overdetermined system 

B [Tt + Sv] = 



and following the Gauss-Seidel step we solve it considering only the lower part 

(S*BS)v = S*(g-BTt). (17) 

Equation ((17) is a Galerkin reduction method with the basis S applied to the system 
Bw = g with the initial guess ©, and TT-cores X [k) and X [k+1) defined by CO). After CE3 



is solved, the updated superblock W^ w writes as follows 

[T« 



X (k) (i k ] 



w (k) 



.T-kT-k+1 . 



UkJ 



A new 



V(i k+1 ) 



T [k) (i k )T (k+1) (ik + i) + S (k) (ik)V(i k+1 ; 
W Ck Htki^i) + S M (i lc )V(i k+1 ), 



(18) 



which allows to consider the proposed method as a solver for the 2D system, which per- 
forms the low-rank correction for the superblock rather than recompute it from scratch. 

Equations (fM)) and ((T7|) can be considered as certain approximate approaches to the 
solution of the 2D system ([9]). Different such approaches can be collected into Table [TJ 
sorted from the highest to the lowest accuracy. 



4 Steepest descent schemes 

4.1 Steepest descent with perturbation 

Given the initial guess t, the steepest descent (SD) step minimizes the energy function ((U) 
over vectors x = t + sot, where the step is chosen as follows 

s = - grad J(t) = y - At = z, 



a = arg min J (t + zoc) 



(z, Az) ' 
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V_. UI I L y It Al L y 




J(k) 


S M 


T(k+1] A/ 

1 V 




DMRG © 


optimize 


optimize 




AMEn (dU) 


keep 


choose 


optimize 


0(r 2 n) 


DMRG correction 


keep 


optimize 


keep optimize 


0(p 3 rn) 


Galerkin correction ([T7| 


keep 


choose 


keep optimize 


O(prn) 



Table 1: Comparison of different solution methods for a two-dimensional system ((D with 
blocks given by (fT8l> . We may keep the block from the previous iteration, choose it arbitrary 
(eg., using quasi-optimal or heuristic choice) or optimize solving the reduced system. In 
the complexity estimates, r is typical rank of X and p is typical rank of S. 

The solution after the SD step satisfies the so-called Galerkin condition (z, y — Ax) = 0. The 
progress of the SD step can be analyzed in terms of A-norms of errors c = x* — t and 
d = x* — x as follows 

llzll 2 ||z|| 2 
x = t + z 7 , d = c — z „ =(I — R z )c. 

IMIi II z IIa 

This gives interpretation in terms of projections and proves the monotone decrease of the 
energy function Ja(x) = ||d|| A ^ ||c|| A = JaM. To estimate the convergence rate, we write 



| d||l = (c, (I - R Z )*A(I - R z )c) = (c, A(I - R z )c) = cu 2 ||c||i, cu 2 



(c,(I-RJc) A 



(c,c) A 

The convergence rate uj z is therefore a square root of the Rayleigh quotient for I — R z in 
the A-scalar product. It can be bounded using the Kantorovich inequality | |20 || as follows 

2_-| ( z > z ) ( z > z ) < A max -A min \ 2 

where A max and A mln denote the largest and smallest eigenvalues of A, respectively. 

The residual z = y — At of the steepest descent method can not be computed exactly 
for high-dimensional problems. Suppose that it is approximated by z and the perturbed 
SD step is applied as follows 

llzll 2 ~ llzll 2 
x = t + z- ! ^ L r , d = c-z- ! ^ L r = (I-Rz)c + R 2 (c-c), (20) 

IMIa II z IIa 

where Ac = z. We further restrict ourselves to the perturbations of the following form 

z = z + 5z, (z, 5z) = 0, ||5z||a ^ £||z||a ^ £||z||a, (21) 

which will appear naturally in our algorithms for higher dimensions. For such perturba- 
tions the second term vanishes, R z (c — c) = 0, and the perturbation of the SD step writes 
through the perturbation of A-orthogonal projectors as follows 

d-d = -(R z -R z )c. 

A comprehensive overview of the perturbation theory for projections, pseudo-inverses 
and least square problems can be found in ||39||. Rather than adapting their results to the 
case of A-orthogonal projectors, we will develop a more accurate estimate for d — d using 
specifically the perturbations (|2T]> . 

11 



Theorem 1. For z given by (|2~Tj) the progress of the perturbed SD step (|20]) writes as follows 

1 



< cu z ||c|| A , cu z = cu z + ey 2(1 — cu 2 ] H cond (A), 

2v 2 

where cu z is the progress of the unperturbed SD step given by (fT9)) . 

Proof. For z = z + 6z the following simple identity can be verified from definition 

R z - R z = ^Mrll - R Z )A + (I - Rz)^A. 
IMIa ll z lli 

The perturbation of the SD step d — d = (R z — R z )c writes 

~ A A ~ \\ z \\ 2 , ~(P)Z) 

IMIa IMIa 

where p = (I — R z )5zand p = (I — R z )5z. Obviously, ||p||a ^ \\&z\\a ^ e|M|a- To estimate 
the A-norm of the second term, we write 

-7* 7,7* A. St. II zlP 

(z,p) = z*(I - Rz)6z = z*Sz - = ||5z|| 2 - f-Uz, 5z) A 

IMIa IMIa 

= ( A -1 6z — yz, 6z)a, 

wherey = ||z|| 2 /||2:||i.Then|(p,z)| < HA^Sz - yz|| a ||6z|| a and 

||A- 1 5z-yz|| 2 A = ||A- 1 5z||i-2y(5z,z)+y 2 ||z|| 2 A =y 2 ||z||i+||6z|| A _ 1 -2y||5z|| 2 
^y 2 ||z|| 2 +||6z|| 2 _,. 

Since p and z are A-orthogonal, we write 

l|d-d|| 2 = ||p||iy 2 + ^^ ^ £ 2 ||z|| 2 y 2 + £ 2 (y 2 ||z|| 2 + ||6z||*_ 1 ) = e 2 (2K + ||6z|| 2 A _,) . 

II z IIa V II z IIa / 

Finally, we estimate 

\\~d-d\\ A R l|z|| 2 jNlj-iNU ^ , 3 cond 2 (A) 

,, ,, ^ £V2 + £ -= - : ^ £a/2 1 - UOt) + £° = , 

I|c||a IM|a||c||a 2y2||z|| 2 ||z|| A - 1 2V 7 ! 

where the last inequality is based on 

II u IIa-i ^ A^ 2 ||u|| ^ a^J|u|| a> ||u|| a -i > A^Hull ^ A^J^U, cond(A) = ^L. 



Since || d|| a ^ II d — d|| a + || d|| a, we obtain the statement of the theorem. □ 

Remark 1. If cu z < 1 there exists £* > such that for all < £ < £* it holds cu z < 1 . 
This critical value £* ( k, cu) is the real positive root of the cubic equation cu z (£) = 1 , where 
k = cond(A) and cu act as parameters. The minimal value of £*(k, cu) for cu ^ (k— 1 )/(k+1 ) 
and k — > oo behaves - 1 +0(k- 3 / 2 ). 
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4.2 Steepest descent in two dimensions 



Consider the two-dimensional linear system Ax = y written in the elementwise notation 
as follows^ 



A(iii 2 , hhMhh) =yfak), U, ji = 1, • • ■ ,tvi, i 2 , h = 1, • • • ,n- 2 . 

As previously, we assume A and y to be given, and x to be sought in the following low-rank 
decomposition format 

A(ii£,WD = A^JiK 2 ^), A (p) = [Ai P) (ip,j P ]] 6 C^ xr -, 

y(i^)=yS 5 1) (ti)y[ 5 2) (i2), = EyW^)] g C^ X1 \ (22) 

where p = 1,2. Given the initial guess t in the same format, we compute the low-rank 
approximation of the residual z ~ z = y — At as follows 

z(VQ = z' 1) (i 1 )z< 2) (i 2 ), Z (1) = [z^(ii)] G C n ' xr % Z (2) = [z< 2) (i 2 )] G C n2X1 \ 

Following the perturbed SD algorithm, we can write the updated solution x = t + zot in a 
form 

xG^ = [TN(ji) Z"J(jO] [^fj, (23) 

and optimize by the step size a. Recalling the considerations from Section 13.31 we can 
consider more efficient optimization steps listed in Table [TJ For example, the solution of 
DMRG system (0) corresponds to the exact solution of the considered 2D system. We 
will particularly consider the Galerkin correction framework, i.e., will optimize over the 
bottom block of X' 2 ', denoted as V in (fTH)) . This is the cheapest method in Tabled} and all 
other methods have better convergence properties. 

In the proposed method we choose the step x = t + Zv where 

Z = Z m ® I n2 e c n ' n2Xr ^ 2 , v(Tx 2 ) = V c (i 2 ), 

and without the loss of generality assume the orthogonality of Z. Minimization of the 
energy function J(x) over v leads to the set of Galerkin conditions Z*(y — Ax) = and the 
step writes as follows 

x = t + Zv, (Z* AZ)v = Z*z. (24) 

Note that if we restrict ourselves to the perturbations z = z + 5z such that Z*6z = 0, it 
holds 

v = (Z*AZ) _1 Z*z = (Z*AZ)- 1 Z*z. 

Then the accuracy of the proposed method can be estimated similarly to the standard SD 
step d = c — Zv = (I — R z )c,and the progress of this step writes 

IMII2 „,2|M|2 ,„2 (c,(I-Rz)c) A 

l|d|| A = u> z II c IIa> w z = 7—\ • (25) 

Since z G span Z it follows that cu z ^ tu^, i-e v the convergence of the proposed method (f24)) 
is not slower than the one of the perturbed SD step (|20)) estimated in Thm. [U 



5 We consider x and y as vectors and at the same time as two-dimensional arrays x = [x(ji, )z)\ an d 
y — [y{U , ij)] with the same entries. We will switch freely between these representations without change 
of a notation. 
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Remark 2. When span Z = C n,Tl2 we converge in one iteration, i.e. cuz = 0. For large Z s.t. 
z G span Z we can expect cu z *C cu z . In general, however, the inequality cu z ^ uj z is sharp. 
To show this, consider Z = [z s] with (z, s) = 0. It is easy to show that 

1 f| ,2 _ WWk _ 1 l(s,z) A | 2 

z Nli-^llslliNli-K^zUF)' i-c^l NlilNIi" ' 

which proves cuz ^ cu z . However, the ratio can be equal to one when (s,z) A = and 
(s, z) = simultaneously. It can happen, eg. if s is an eigenvector of A. Similarly, if there 
is a k-dimensional invariant subspace of A which is orthogonal to z, we can form Z = 
[z si ... Sic] from the basis vectors of this subspace and have the same convergence cu z = 
iv z as the SD step does. 

To find the correction term v we have to solve the reduced linear system size r z U2, 
which writes as follows 

Bv = g, B = Z*AZ, g = Z*z. (26) 

Suppose that n 2 is still too large for the system to be solved exactly and we find the ap- 
proximate solution v k v t = B ] g. The simplest idea is to solve the reduced problem 
by the standard SD method. The following theorem estimates the progress of such 'lazy' 
approach. 

Theorem 2. Consider the system Ax = y with the initial guess t and error c = x* — t. After 
one outer step of SD (|24)> and one inner step of SD applied to the reduced problem (|26)> , 
the error d = x* — x writes as follows 

d=((I-R 2 ) + Z(I-Q g )Z*R z )c, 
= ((I-Rz) + (I-R z )R z )c, (27) 

l|d|| A = (c4 + 0-a> z ) 2 cu 2 g ) ||c||i, 
where Q g is the B-orthogonal projector on g, and w g < cu z . 

Proof. If v is the obtained (approximate) solution of (|26"1> , the progress of the step (|24)> is 

d = x*-x = c-Zv = (I - R z )c + Z(v* - v), 
H| A = || x* -x|| A = ||(I — R z )c|ft + ||Z(v* -v)\\\ = u) 2 z \\c\\ 2 A + ||v* -v||b, 



where in the last line we use the A-orthogonality of the two terms. The initial guess for v 
is zero, and after one step of the SD applied to (|26"1> the error is 

v, - v = (I - Q g ) (v, - 0) = (I - Q g )B- ] g = (I - Q g ) (Z* AZ)" 1 Z*z. 

The first line of the theorem now follows by the definition of Rz. To prove the second line 
it is enough to note that 

_ Zgg*Z*AZZ* _ zz*AZZ* _ 
^ 9 g*Z*AZg z*Az ~ ~ z ' 

and ZZ*R Z = Rz- The progress of the inner SD step is ||v* — v|| B = cu g ||v*||B, where 

I|v*|| 2 b = IIB-^Hb = l|Z*z|| 2 B _, = ||rz|| 2 B _, = (z^trAZJ-Tz) = (c,R z c) A = (1-a> 2 z )||c||i. 
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Substituting these estimates to ((28)) we obtain the second claim of the theorem. 
Now we prove that cu g ^ cu 2 . Similarly to (fT9)) we have 

w z _ [v*2 (i- Q 9 )vJb _ 1 llgll 4 



9 (v*,v*) B (g,Bg)(g,B- 1 g)" 

Since Z is orthogonal, ||g|| = ||Z*z|| = ||z||. It also holds that (g, Bg) = (Zg,AZg) = (z, Az). 
Finally we show that 

(g^g) = (z,Z(Z*AZ)- 1 Z*z) = (A _1 z, R z A _1 z)a ^ HA^zlli = (z,A~ 1 z), 

which completes the proof. □ 

The second term of (f28)> can be written also as follows 

ZK - v) = Z(I - Q g )v, = Z ( I - ^) B- 1 g = ZB 1 Z* z - j^z 

V g B gy IIqIIb 



ZB 1 Z*z - j^z = R z c - R z c, 

IIzIIa 



which gives d = (I — R 2 )c. This shows that the combination of one outer and one inner SD 
step is equivalent to the SD step with perturbation ([20)) . This is also easily seen from the 
structure of our inner-outer method itself. Indeed, in the outer step we add components 
ZP' to the basis set and in the inner step we add components of the inner residual g = 
Z*z = z 2 , where z 2 contains the elements of Z (2 ' stretched into one vector. Therefore, the 
described inner-outer scheme is equivalent to one 'global' SD step. 

The idea behind Theorem|2]is of course not to prove a slightly worse estimate in a more 
complicated way. In the recursive algorithm the second term in (|28)> will be obtained by 
the SD step followed by further optimization which will decrease the error of the reduced 
problem and consequently the total error. The SD step is therefore required as an ini- 
tial guess for which we can provide a theoretical estimate of convergence. The practical 
convergence that we expect is of course better than the upper estimate in ((27|) . 

Remark 3. Regarding the spectrum of reduced problems, the following two-side inequal- 
ity is proved in |28H 

(LTAU)- 1 ^ ITA-'U ^ (A ^ l + Amax)2 (U*AU)- 1 , 

where U is unitary matrix and B ^ C means that B — C is positive definite. The last 
inequality used in Theorem |2] follows from the left part of this inequality (which is itself 
rather elementary). 



4.3 Greedy descent method 

In higher dimensions we can further improve the steepest descent step by an ALS cycle 
over the step vector, as shown by Alg. [TJ This algorithm searches for max seTr J(t + s) 
using the ALS optimization and therefore can be considered as a greedy algorithm. The 
application of greedy algorithms to optimization in tensor formats was rigorously studied 
in EZHZH23. 
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Algorithm 1 x = t + ALS(z) 

Require: System Ax = y and initial guess t in the TT-format ([1]), approximate residual 

z = t(Z) G 7 t . 
Ensure: Updated solution x = t + v, v = t(V) G T r . 
1: for k = d, . . . , 1 do {Cycle over TT-cores} 

2: Find V (k) = argmin z(k) J(t + t(Z (1) , . . . , Z^ 1 ', Z (k) , V (k+1) , . . . , V (d) )) 
3: end for 

4: return v = t(V (1 >,..., V (d) ) 



Alg. Q] starts from the SD step with perturbation, and then the energy function is addi- 
tionally improved by an alternative minimization cycle. The combined progress is there- 
fore not worse than the one of the SD step, ||d|| A ^ cOz||c|| A , given by Thm. [TJ Another 
estimate is proven by the following theorem. 

Theorem 3. Consider the system Ax = y with the initial guess t and error c = x* — t. The 
step described by Alg. [T] returns the solution x = t + v such that the error d = x* — x is 
bounded as follows 



< vf f co? + (1 - co 2 )v 2 (co 2 + (1 - co 2 )v 2 (co 2 + . . . + v^tuiO . . 





(29) 



(c,c) A 



v k ^l. 



Proof. In 2D the statement of the theorem reads ||d|| A < vfcuf||c|| A .Itiseasytoseethatthe 
ALS update over Z' 2 ' gives exactly the two-dimensional SD step (|24|) with the progress 
cuz = cuz, = coi given by ((251) . The ALS update over ZP' further improves the energy 
function by the factor v 2 ^ 1 , which proves the statement of the theorem for d = 2. The 
base of the recursion is proved. 

After a microstep when Z' k+1 ' is optimized and becomes V' k+1 ', the solution writes as 
follows 

x k = t + Z <k v >k , Z <k G C n ' - niXrkUk+l - nd , v >k G C™ + ' ~ n * , 
where v >k = T(V< k+1 \ . . . , V™), i.e. v >k (<x k j k+1 . . . jd) = v££i, Ofc+i ) • • • (j d ), and 



Z^ k = 3^ k (Z) = Z^ k ® I nk+1 ® . . . ® I nd , 



^k(ii ■ ■ - id, <*kjk+i • • - jd) = Z^fiOZgJ (i 2 ) . . .Z^ a . (i k )6(ik+i, jk+i) • • -S(i d , j d )- 



sSk 

' In, . '.- . . . ', In ,, 

(30) 

, , . , . , li. 1 ftl ii. . i , li. . i I AIIj.IjI 

This equation is similar to the two-dimensional SD step (|24|) and allows to estimate the 
progress of Alg. fusing the result of Thm. |2] recursively. Following (|28)> , the progress can 
be written as follows 

II** - *1c||a / 2 , m 2 Jl v >k,* — v >k||i k j , 01 . 

^= co k + (l-tu k )— ) , (31) 



X* t|| A \ ll'^>k,* 0| 



A 



where A k = 2,^ k A2,< sk , z k = Z*^ k z and v >1Cj * is the exact solution of the reduced problem 
A k v >k = z k . Note that z k = 2^ k T(Z (1 \ . . . , Z (d) ) = T(Z (k+1 \ . . . , Z [d) ), so the inner SD steps 
will share the TT-factors of the same residual z. 
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To prove the recursion step, assume that the theorem holds in the dimension d — 1 , 
write (|3T]> with k = 1 and apply (|29)> for the second term as follows 



ll 2 / d~ ^ k— 1 k. \ (- > . 

V Hb < f V" A2TTM ^2,TT^2\ ^2_ n l V *>QS^V*J B 




where B = Z* AZ, g = Z*z, Z = Zi = Z (1 ' ® I® . . .® I, v* is the exact solution of Bv = g, Q g ^ k 
is the B-orthogonal projector on g<g k and S^k = y«:k(G) is defined for t(G) = g similarly 
to (|30)> . Since ZS^k = %ck+i/ and ||v*||b = ||c||a we have 



(v*,Q S<lc vJ B = (^ZS^O^Bg^J-'g^z) = (c,R w c) A , 

and d\ = uj k+ i ■ Similarly v k+ i = v k now defines the progress of the ALS microstep over 
the components of G' k ' = Z' k+1 '. Updating 7y> by the ALS step we reduce the error by the 
factor V] and write the total progress as follows 



X M2 



d k-1 




X LI| A \ k=2 j=2 

which completes the proof. □ 

Remark 4. Under the conditions of the theorem it holds ||cI||a ^ coa— i ||c||a- Indeed, after 
the first ALS microstep the solution has the form Xd-i = t + %s;d-iVd> see (|30l> . Comparing 
this to the steepest descent in 2D (|24)> we follow (|25l> and claim the convergence rate cju|_-, 
for x d _i and consequently for the result of Alg. [TJ due to the monotone convergence of the 
ALS. 

Remark 5. If ALS steps occasionally give no progress, i.e. v k = 1 , the progress <v of Alg.Q] 
given by (|29[) satisfies 

d-l 



w 2 = U - cut J . . . [I - cua_! J = X 1 1 1 - a*r ) ^ I - uj d _, . 

k=1 



It follows that in this case w 2 ^ cu^_ lr and the convergence estimate given by the previous 
remark is better than the one given by the theorem. If a sensible estimates for v k are 
available, we can plug them in (|29)> to estimate the combined progress of the SD and ALS 
steps. 



4.4 Non-greedy combination of the steepest descent and ALS 

Alg. [TJis a greedy-type algorithm. Such algorithms are likely to have a slow convergence 
or stagnate at some error level. To improve the practical convergence we can apply the 
ALS optimization to the whole solution vector x = t + zct, as shown by Alg. El 

Just like Alg. [TJ the non-greedy Alg. |2] starts from the steepest descent step and then 
improves the energy function by a number of ALS updates. Therefore, the progress of 
Alg.|2]is estimated by the one of the SD algorithm, || d|| A ^ a>2||c|| A . The better estimate of 
Remark|4]also applies to Alg.El i.e. ||d||A ^ u>d-i ||c||a- This follows from the fact that the 
optimization over X' d ' gives better energy function than the optimization over the lower 
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Algorithm 2 x = ALS(t + z) 

1: SetX = (X (1) ,...,X (d) ) =T + Z 

2: for k = d, . . . , 1 do {Cycle over TT-cores} 

3: Find Xi k e ' w = argmin x(k) }{x[X^\ . . . , X< k >, x£+ 1] , . . . , X^l)) 

4: end for 

5: return x = x(Xnew, • • • , xitlv) 



part of this TT-block V' d) , performed in greedy Alg.[H However, we cannot generalize the 
result of Thm.|3]for Alg.El since the non-greedy ALS update destroys the T+Z structure of 
the interfaces. The practically observed convergence of this method is nevertheless much 
better than that of the greedy descent method. More rigorous analysis of the convergence 
of ALS schemes can probably provide much better estimates for the convergence rate of 
the proposed algorithm. 

In the sequel we will develop a version of the algorithm which mixes the ALS and SD 
steps, following (|T4)), cf. line AMEn' in Tabled] For this algorithm it is possible to analyze 
the convergence recurrently similarly to Theorem (|29|) . The mixed AMEn version also has 
better convergence properties for the practical problems considered in 0. 

5 Practical implementation of tensor truncations 

Throughout the paper, we considered vectors, perturbed due to the tensor approximation. 
Now we highlight the practical features of this operation. 

The TT-rounding procedure [30J performs the recursive SVD-based truncations, which 
reduce the TT-ranks. The truncation of the k-th unfolding writes as follows, 

X {k} (i! . . .i k ,i k+ i . . .i d ) = U(i, . . .i k> a)ff(a)V*(a,i k+1 . . .i d ), 

where matrices U and V are orthogonal. The approximation algorithm returns 

X {k} = UU*X {k} , 5X {k} = (I - UU*)X {k} , 

where U contains the r first (dominant) vectors of U. It follows by the construction of 
the TT-SVD algorithm that (X {k} )*(SX {k} ) = 0, and therefore (x, 5x) = (t(X),t(5X)) = 0. 
We rely on this property for the residual approximation ((21) in the accuracy analysis of 
the perturbed steepest descent method, see Theorem [TJ The block version of the same 
orthogonality condition is used in the derivation of the two-dimensional steepest descent 
progress (|25)> . 

The SVD algorithm truncates a vector in the Frobenius norm, i.e. chooses the approxi- 
mation rank considering a sum of squared smallest singular values. To satisfy the accuracy 
assumption in (|2T]> we need to perform the accuracy control in the A-norms, ||6z|| A ^ 
An optimal approximation in the A-norms is a difficult problem. We can either truncate 

1 ^ 1 /2 

in the Frobenius norm and rely on the norm equivalence HxllA^ ^ ||x|| A ^ ||x||Amax, or 
follow the cheap heuristic strategy proposed in [9]. In the inner steps of the TT-rounding 
procedure, after the SVD is computed, we throw away the smallest singular values one by 
one, while the local error/ residual is below the tolerance, i.e. 

HXM-UIVII^a^ ^ e||X( k '||^ kA ^ k , or 
m k AOV k (XM-UIV*)|| ^ £||^ k A^ k XM||. 
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The basis enrichment step developed in our paper can only increase the TT-ranks of the 
solution. To make the procedure computationally feasible, we need to introduce a trun- 
cation step, which will reduce the solution ranks. To do this, we apply the TT-rounding 
procedure between the iterations, which perturbs the solution and can increase the en- 
ergy function. Therefore, the truncation accuracy has to be chosen accurately to provide 
the convergence of the methods with approximation. 

Assume that a step of the proposed method has the following progress, 

||x* - x|| A < £l||x* — t|| A . 

The progress after the approximation ||x — x|| A ^ £ x ||x|| A reads 

||x* - x|| A = ||x* - x + x - x|| A ^ Q||x* — t|| A + e x ||x|| A . 

While the energy function is large, the first term dominates for sufficiently small e x . In the 
end of the process, the perturbation error is comparable to the progress of the method, 
and the algorithm stagnates. We will see this in numerical examples. 

6 Numerical experiments 

Let us verify the methods proposed on a model example of symmetric positive definite 
system: 

-Ax = e, xeO = [0:1] d , x| ao = 0, 

where A is the standard finite difference Laplacian discretization on a uniform grid with 
the mode size 64 in each direction, i.e., the linear system has 64 d unknowns. The right- 
hand side e is the vector of all ones. Such a system arises naturally in the heat transfer 
simulation, or to precondition more complex elliptic problems. Note that the matrix and 
the right-hand side have exact low-rank representations, see |2"Tll31| . 
For different d we compare the following methods in Fig. |3j 

• the DMRG method presented in ("dmrg"); 

• the 2D SD method d2U> in a form x = t + 2^ d -iv d ("x = t + Zv"); 

• the greedy algorithm Q]("x = t + ALS(z)"); 

• the non-greedy algorithm ("x = ALS(t + z)"); 

• wherever possible, the standard (vectorized) steepest descend ("sd"). 

The TT-rank of the enrichment z was chosen p = 5, and the solution after each step was 
approximated with the relative truncation tolerance e x = 1 4 in the Frobenius norm. 

The convergence of the considered methods is compared in Fig. [31 A one-dimensional 
sweep is considered as one iteration, the progress of micro-iterations is also shown when- 
ever possible. We can make the following remarks based on the experimental results. 

• ALS steps sufficiently improves the convergence of all considered methods, i.e. the 
pessimistic assumptions of Remark [5] do not hold. A refined analysis of ALS conver- 
gence rates v k is still an open question. 
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Figure 3: A-norm of the error in different methods versus iterations (left), and CPU time 
(right). Dimension of the problem is d = 3 (top), d = 1 6 (middle), d = 64 (bottom). 
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• The convergence of non-greedy Alg.|2]is comparable to the one of the DMRG iteration- 
wise. However, the complexity of each DMRG iteration is cubic in the mode size, 
while the proposed methods have linear complexity This is clearly demonstrated 
in the right column, where the convergence is shown w.r.t. the computational time. 
The proposed methods time-wise are up to 1 00 times faster than the DMRG for this 
problem. 

• The one-step steepest descend method shows the slowest convergence, which is a 
direct consequence of the narrow (one vector) direction subspace. This indicates 
that the upper bounds of the convergence rate established in the paper might be 
seriously overestimated. 

7 Conclusion and future work 

In this paper we equip the ALS scheme with a basis enrichment step, which is chosen in 
accordance with the steepest descent algorithm. The resulted method demonstrates the 
convergence almost as good as the one of DMRG, while has the linear in the mode size 
and dimension complexity of ALS. Moreover, the global convergence rate is established 
similarly to the one of the steepest descent. Up to the best of our knowledge, this is the 
first result on the global convergence of a numerically efficient solution method for linear 
systems in higher dimensions. The proposed algorithm combines the advances of opti- 
mization methods in tensor formats (ALS, DMRG) with the ones of classical methods of 
numerical analysis. 

The proposed family of methods includes the algorithm with greedy-type step, for 
which the theoretical results obtained in the framework of greedy algorithms can be ap- 
plied. However, other algorithms developed in the non-greedy style also have proven 
convergence rate and manifest much better convergence in numerical experiments. 

The results of this paper can be developed in the following directions. First, the anal- 
ysis for the non-symmetric systems can be made similarly to this paper, substituting the 
steepest descent algorithm by the minimal residual method. The second Krylov vector 
is required in MINRES-type algorithms, which have to be approximated and the conver- 
gence of perturbed method should be discussed similarly to the Theorem [TJ Second, the 
complexity of the proposed methods w.r.t. tensor ranks should be studied and improved 
using faster (eg, cross) approximation schemes. Finally, we will develop and analyze the 
AMEn method for which the enrichment steps are mixed with ALS optimization, i.e., there 
is no explicit steepest descent step. 

The proposed algorithms are already applied to the solution of the chemical master 
equation in dimensions up to twenty [7], and more practical applications will follow soon. 
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